665 research outputs found

    Classification of high dimensional data using LASSO ensembles

    Get PDF
    Urda, D., Franco, L. and Jerez, J.M. (2017). Classification of high dimensional data using LASSO ensembles. Proceedings IEEE SSCI'17, Symposium Series on Computational Intelligence, Honolulu, Hawaii, U.S.A. (2017). ISBN: 978-1-5386-2726-6The estimation of multivariable predictors with good performance in high dimensional settings is a crucial task in biomedical contexts. Usually, solutions based on the application of a single machine learning model are provided while the use of ensemble methods is often overlooked within this area despite the well-known benefits that these methods provide in terms of predictive performance. In this paper, four ensemble approaches are described using LASSO base learners to predict the vital status of a patient from RNA-Seq gene expression data. The results of the analysis carried out in a public breast invasive cancer (BRCA) dataset shows that the ensemble approaches outperform statistically significant the standard LASSO model considered as baseline case. We also perform an analysis of the computational costs involved for each of the approaches, providing different usage recommendations according to the available computational power.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tec

    Towards a Reliable Comparison and Evaluation of Network Intrusion Detection Systems Based on Machine Learning Approaches

    Get PDF
    Presently, we are living in a hyper-connected world where millions of heterogeneous devices are continuously sharing information in different application contexts for wellness, improving communications, digital businesses, etc. However, the bigger the number of devices and connections are, the higher the risk of security threats in this scenario. To counteract against malicious behaviours and preserve essential security services, Network Intrusion Detection Systems (NIDSs) are the most widely used defence line in communications networks. Nevertheless, there is no standard methodology to evaluate and fairly compare NIDSs. Most of the proposals elude mentioning crucial steps regarding NIDSs validation that make their comparison hard or even impossible. This work firstly includes a comprehensive study of recent NIDSs based on machine learning approaches, concluding that almost all of them do not accomplish with what authors of this paper consider mandatory steps for a reliable comparison and evaluation of NIDSs. Secondly, a structured methodology is proposed and assessed on the UGR'16 dataset to test its suitability for addressing network attack detection problems. The guideline and steps recommended will definitively help the research community to fairly assess NIDSs, although the definitive framework is not a trivial task and, therefore, some extra effort should still be made to improve its understandability and usability further

    Diseño de sistemas neurocomputacionales en el ámbito de la Biomedicina

    Get PDF
    El área de la biomedicina es un área extensa en el que las entidades públicas de cada país han invertido y continúan invirtiendo en investigación una gran cantidad de financiación a través de proyectos nacionales, europeos e internacionales. Los avances científicos y tecnológicos registrados en los últimos quince años han permitido profundizar en las bases genéticas y moleculares de enfermedades como el cáncer, y analizar la variabilidad de respuesta de pacientes individuales a diferentes tratamientos oncológicos, estableciendo las bases de lo que hoy se conoce como medicina personalizada. Ésta puede definirse como el diseño y aplicación de estrategias de prevención, diagnóstico y tratamiento adaptadas a un escenario que integra la información del perfil genético, clínico, histopatológico e inmuhistoquímico de cada paciente y patología. Dada la incidencia de la enfermedad de cáncer en la sociedad, y a pesar de que la investigación se ha centrado tradicionalmente en el aspecto de diagnóstico, es relativamente reciente el interés de los investigadores por el estudio del pronóstico de la enfermedad, aspecto integrado en la tendencia creciente de los sistemas nacionales de salud pública hacia un modelo de medicina personalizada y predictiva. El pronóstico puede ser definido como conocimiento previo de un evento antes de su posible aparición, y puede enfocarse a la susceptibilidad, supervivencia y recidiva de la enfermedad. En la literatura, existen trabajos que utilizan modelos neurocomputacionales para la predicción de casuísticas muy particulares como, por ejemplo, la recidiva en cáncer de mama operable, basándose en factores pronóstico de naturaleza clínico-histopatológica. En ellos se demuestra que estos modelos superan en rendimiento a las herramientas estadísticas tradicionalmente utilizadas en análisis de supervivencia por el personal clínico experto. Sin embargo, estos modelos pierden eficacia cuando procesan información de tumores atípicos o subtipos morfológicamente indistinguibles, para los que los factores clínicos e histopatológicos no proporcionan suficiente información discriminatoria. El motivo es la heterogeneidad del cáncer como enfermedad, para la que no existe una causa individual caracterizada, y cuya evolución se ha demostrado que está determinada por factores no sólo clínicos sino también genéticos. Por ello, la integración de los datos clínico-histopatológicos y proteómico-genómica proporcionan una mayor precisión en la predicción en comparación con aquellos modelos que utilizan sólo un tipo de datos, permitiendo llevar a la práctica clínica diaria una medicina personalizada. En este sentido, los datos de perfiles de expresión provenientes de experimentos con plataformas de microarrays de ADN, los datos de microarrays de miRNA, o más recientemente secuenciadores de última generación como RNA-Seq, proporcionan el nivel de detalle y complejidad necesarios para clasificar tumores atípicos estableciendo diferentes pronósticos para pacientes dentro de un mismo grupo protocolizado. El análisis de datos de esta naturaleza representa un verdadero reto para clínicos, biólogos y el resto de la comunidad científica en general dado el gran volumen de información producido por estas plataformas. Por lo general, las muestras resultantes de los experimentos en estas plataformas vienen representadas por un número muy elevado de genes, del orden de miles de ellos. La identificación de los genes más significativos que incorporen suficiente información discriminatoria y que permita el diseño de modelos predictivos sería prácticamente imposible de llevar a cabo sin ayuda de la informática. Es aquí donde surge la Bioinformática, término que hace referencia a cómo se aplica la ciencia de la información en el área de la biomedicina. El objetivo global que se intenta alcanzar en esta tesis consiste, por tanto, en llevar a la práctica clínica diaria una medicina personalizada. Para ello, se utilizarán datos de perfiles de expresión de alguna de las plataformas de microarrays más relevantes con objeto de desarrollar modelos predictivos que permitan obtener una mejora en la capacidad de generalización de los sistemas pronóstico actuales en el ámbito clínico. Del objetivo global de la tesis pueden derivarse tres objetivos parciales: el primero buscará (i) pre-procesar cualquier conjunto de datos en general y, datos de carácter biomédico en particular, para un posterior análisis; el segundo buscará (ii) analizar las principales deficiencias existentes en los sistemas de información actuales de un servicio de oncología para así desarrollar un sistema de información oncológico que cubra todas sus necesidades; y el tercero buscará (iii) desarrollar nuevos modelos predictivos basados en perfiles de expresión obtenidos a partir de alguna plataforma de secuenciación, haciendo hincapié en la capacidad predictiva de estos modelos, la robustez y la relevancia biológica de las firmas genéticas encontradas. Finalmente, se puede concluir que los resultados obtenidos en esta tesis doctoral permitirían ofrecer, en un futuro cercano, una medicina personalizada en la práctica clínica diaria. Los modelos predictivos basados en datos de perfiles de expresión que se han desarrollado en la tesis podrían integrarse en el propio sistema de información oncológico implantado en el Hospital Universitario Virgen de la Victoria (HUVV) de Málaga, fruto de parte del trabajo realizado en esta tesis. Además, se podría incorporar la información proteómico-genómica de cada paciente para poder aprovechar al máximo las ventajas añadidas mencionadas a lo largo de esta tesis. Por otro lado, gracias a todo el trabajo realizado en esta tesis, el doctorando ha podido profundizar y adquirir una extensa formación investigadora en un área tan amplia como es la Bioinformática

    Improving the Reliability of Network Intrusion Detection Systems Through Dataset Integration

    Get PDF
    This work presents Reliable-NIDS (R-NIDS), a novel methodology for Machine Learning (ML) based Network Intrusion Detection Systems (NIDSs) that allows ML models to work on integrated datasets, empowering the learning process with diverse information from different datasets. We also propose a new dataset, called UNK22. It is built from three of the most well-known network datasets (UGR'16, USNW-NB15 and NLS-KDD), each one gathered from its own network environment, with different features and classes, by using a data aggregation approach present in R-NIDS. Therefore, R-NIDS targets the design of more robust models that generalize better than traditional approaches. Following R-NIDS, in this work we propose to build two well-known ML models for reliable predictions thanks to the meaningful information integrated in UNK22. The results show how these models benefit from the proposed approach, being able to generalize better when using UNK22 in the training process, in comparison to individually using the datasets composing it. Furthermore, these results are carefully analyzed with statistical tools that provide high confidence on our conclusions. Finally, the proposed solution is feasible to be deployed in network production environments, not usually taken into account in the literature.16 página

    A Clustering-Based Hybrid Support Vector Regression Model to Predict Container Volume at Seaport Sanitary Facilities

    Get PDF
    An accurate prediction of freight volume at the sanitary facilities of seaports is a key factor to improve planning operations and resource allocation. This study proposes a hybrid approach to forecast container volume at the sanitary facilities of a seaport. The methodology consists of a three-step procedure, combining the strengths of linear and non-linear models and the capability of a clustering technique. First, a self-organizing map (SOM) is used to decompose the time series into smaller clusters easier to predict. Second, a seasonal autoregressive integrated moving averages (SARIMA) model is applied in each cluster in order to obtain predicted values and residuals of each cluster. These values are finally used as inputs of a support vector regression (SVR) model together with the historical data of the cluster. The final prediction result integrates the prediction results of each cluster. The experimental results showed that the proposed model provided accurate prediction results and outperforms the rest of the models tested. The proposed model can be used as an automatic decision-making tool by seaport management due to its capacity to plan resources in advance, avoiding congestion and time delays

    Artificial Neural Networks, Sequence-to-Sequence LSTMs, and Exogenous Variables as Analytical Tools for NO2 (Air Pollution) Forecasting: A Case Study in the Bay of Algeciras (Spain)

    Get PDF
    This study aims to produce accurate predictions of the NO2 concentrations at a specific station of a monitoring network located in the Bay of Algeciras (Spain). Artificial neural networks (ANNs) and sequence-to-sequence long short-term memory networks (LSTMs) were used to create the forecasting models. Additionally, a new prediction method was proposed combining LSTMs using a rolling window scheme with a cross-validation procedure for time series (LSTM-CVT). Two different strategies were followed regarding the input variables: using NO2 from the station or employing NO2 and other pollutants data from any station of the network plus meteorological variables. The ANN and LSTM-CVT exogenous models used lagged datasets of different window sizes. Several feature ranking methods were used to select the top lagged variables and include them in the final exogenous datasets. Prediction horizons of t + 1, t + 4 and t + 8 were employed. The exogenous variables inclusion enhanced the model's performance, especially for t + 4 (rho approximate to 0.68 to rho approximate to 0.74) and t + 8 (rho approximate to 0.59 to rho approximate to 0.66). The proposed LSTM-CVT method delivered promising results as the best performing models per prediction horizon employed this new methodology. Additionally, per each parameter combination, it obtained lower error values than ANNs in 85% of the cases

    Search for new particles in events with energetic jets and large missing transverse momentum in proton-proton collisions at root s=13 TeV

    Get PDF
    A search is presented for new particles produced at the LHC in proton-proton collisions at root s = 13 TeV, using events with energetic jets and large missing transverse momentum. The analysis is based on a data sample corresponding to an integrated luminosity of 101 fb(-1), collected in 2017-2018 with the CMS detector. Machine learning techniques are used to define separate categories for events with narrow jets from initial-state radiation and events with large-radius jets consistent with a hadronic decay of a W or Z boson. A statistical combination is made with an earlier search based on a data sample of 36 fb(-1), collected in 2016. No significant excess of events is observed with respect to the standard model background expectation determined from control samples in data. The results are interpreted in terms of limits on the branching fraction of an invisible decay of the Higgs boson, as well as constraints on simplified models of dark matter, on first-generation scalar leptoquarks decaying to quarks and neutrinos, and on models with large extra dimensions. Several of the new limits, specifically for spin-1 dark matter mediators, pseudoscalar mediators, colored mediators, and leptoquarks, are the most restrictive to date.Peer reviewe

    Measurement of prompt open-charm production cross sections in proton-proton collisions at root s=13 TeV

    Get PDF
    The production cross sections for prompt open-charm mesons in proton-proton collisions at a center-of-mass energy of 13TeV are reported. The measurement is performed using a data sample collected by the CMS experiment corresponding to an integrated luminosity of 29 nb(-1). The differential production cross sections of the D*(+/-), D-+/-, and D-0 ((D) over bar (0)) mesons are presented in ranges of transverse momentum and pseudorapidity 4 < p(T) < 100 GeV and vertical bar eta vertical bar < 2.1, respectively. The results are compared to several theoretical calculations and to previous measurements.Peer reviewe

    Combined searches for the production of supersymmetric top quark partners in proton-proton collisions at root s=13 TeV

    Get PDF
    A combination of searches for top squark pair production using proton-proton collision data at a center-of-mass energy of 13 TeV at the CERN LHC, corresponding to an integrated luminosity of 137 fb(-1) collected by the CMS experiment, is presented. Signatures with at least 2 jets and large missing transverse momentum are categorized into events with 0, 1, or 2 leptons. New results for regions of parameter space where the kinematical properties of top squark pair production and top quark pair production are very similar are presented. Depending on themodel, the combined result excludes a top squarkmass up to 1325 GeV for amassless neutralino, and a neutralinomass up to 700 GeV for a top squarkmass of 1150 GeV. Top squarks with masses from 145 to 295 GeV, for neutralino masses from 0 to 100 GeV, with a mass difference between the top squark and the neutralino in a window of 30 GeV around the mass of the top quark, are excluded for the first time with CMS data. The results of theses searches are also interpreted in an alternative signal model of dark matter production via a spin-0 mediator in association with a top quark pair. Upper limits are set on the cross section for mediator particle masses of up to 420 GeV
    corecore